Search CORE

7 research outputs found

On PAC-Bayesian Bounds for Random Forests

Author: Igel Christian
Lorenzen Stephan Sloth
Seldin Yevgeny
Publication venue
Publication date: 01/01/2019
Field of study

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments

arXiv.org e-Print Archive

Copenhagen University Research Information System

Information Bottleneck: Exact Analysis of (Quantized) Neural Networks

Author: Igel Christian
Lorenzen Stephan Sloth
Nielsen Mads
Publication venue
Publication date: 24/06/2021
Field of study

The information bottleneck (IB) principle has been suggested as a way to analyze deep neural networks. The learning dynamics are studied by inspecting the mutual information (MI) between the hidden layers and the input and output. Notably, separate fitting and compression phases during training have been reported. This led to some controversy including claims that the observations are not reproducible and strongly dependent on the type of activation function used as well as on the way the MI is estimated. Our study confirms that different ways of binning when computing the MI lead to qualitatively different results, either supporting or refusing IB conjectures. To resolve the controversy, we study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors. In this setting, we observed a fitting phase for all layers and a compression phase for the output layer in all experiments; the compression in the hidden layers was dependent on the type of activation function. Our study shows that the initial IB results were not artifacts of binning when computing the MI. However, the critical claim that the compression phase may not be observed for some networks also holds true

arXiv.org e-Print Archive

Copenhagen University Research Information System

Learning from Educational Data:Improving Methods and Theoretical Guarantees for Data Mining

Author: Lorenzen Stephan Sloth
Publication venue: Department of Computer Science, Faculty of Science, University of Copenhagen
Publication date: 01/01/2019
Field of study

Copenhagen University Research Information System

On predicting student performance using low-rank matrix factorization techniques

Author: Alstrup Stephen
Lorenzen Stephan Sloth
Pham Dang Ninh
Publication venue: 'Academic Conferences and Publishing International - ACPIL'
Publication date: 01/10/2017
Field of study

Copenhagen University Research Information System

Steiner tree heuristics in Euclidean <i>d</i>-space

Author: E. Olsen Andreas
Fonseca Rasmus
Lorenzen Stephan Sloth
Winter Pawel
Publication venue
Publication date: 01/01/2014
Field of study

Copenhagen University Research Information System

Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote

Author: Igel Christian
Lorenzen Stephan Sloth
Masegosa Andres
Seldin Yevgeny
Wu Yi-Shan
Publication venue
Publication date: 01/01/2021
Field of study

Copenhagen University Research Information System

VBN

Using machine learning for predicting intensive care unit resource use during the COVID-19 pandemic in Denmark

Author: Igel Christian
Jimenez-Solem Espen
Lorenzen Stephan Sloth
Nielsen Mads
Perner Anders
Petersen Tonny Studsgaard
Sillesen Martin
Thorsen-Meyer Hans-Christian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/09/2021
Field of study

The COVID-19 pandemic has put massive strains on hospitals, and tools to guide hospital planners in resource allocation during the ebbs and flows of the pandemic are urgently needed. We investigate whether machine learning (ML) can be used for predictions of intensive care requirements a fixed number of days into the future. Retrospective design where health Records from 42,526 SARS-CoV-2 positive patients in Denmark was extracted. Random Forest (RF) models were trained to predict risk of ICU admission and use of mechanical ventilation after n days (n = 1, 2, …, 15). An extended analysis was provided for n = 5 and n = 10. Models predicted n-day risk of ICU admission with an area under the receiver operator characteristic curve (ROC-AUC) between 0.981 and 0.995, and n-day risk of use of ventilation with an ROC-AUC between 0.982 and 0.997. The corresponding n-day forecasting models predicted the needed ICU capacity with a coefficient of determination (R(2)) between 0.334 and 0.989 and use of ventilation with an R(2) between 0.446 and 0.973. The forecasting models performed worst, when forecasting many days into the future (for large n). For n = 5, ICU capacity was predicted with ROC-AUC 0.990 and R(2) 0.928, and use of ventilator was predicted with ROC-AUC 0.994 and R(2) 0.854. Random Forest-based modelling can be used for accurate n-day forecasting predictions of ICU resource requirements, when n is not too large

Copenhagen University Research Information System

PubMed Central